home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Amiga Tools 5
/
Amiga Tools 5.iso
/
tools
/
developer-tools
/
assembler-tools
/
pllbc2p
/
pllbc2p_release.txt
< prev
next >
Wrap
Text File
|
1980-01-10
|
21KB
|
571 lines
*=------------------------------------------------------------------------=*
>> Precalc Linear lookup Blitter C2P <<
Version: 1.04
Released: 27th-04-1996
Written & Designed By: Kevin Picone
(c) Copyright 1996 by Kevin Picone of Underware Design
All Rights Reserved.
*=------------------------------------------------------------------------=*
Contact: 'Kevin Picone' at Email: uwdesign@lin.cbl.com.au
*=------------------------------------------------------------------------=*
Features:
* Linear Chunky Frame Buffer
* Single,Double & Quad Pixel widths are supported
* Uses Normal (none resorted Pixels)
* 256/64/16 Colour Versions
* Various Specialized C2P methods.
* Normal
* Delta
* Null Skip & Clear
* Delta Null Skip & Clear (NEW to V1.04)
* 16bit conversion with 32bit Writes to ChipRAM
Special Requirements:
* ECS/AGA Blitter for long blits.
* Fastram for Precalc buffer (from 512k -> 2meg ;)
* Normal Planar Screen (NOT interleaved)
* No Screen modulo is allowed
Disadvantages:
* Rather Hungry upon FASTRAM
* Extra ChipRAM demands
* Uses the Blitter
* Provided Sources are very bulky
* Can be cumbersome to initially add
* No 'C' support (sorry!)
*=------------------------------------------------------------------------=*
* Copyright:
============
The included source codes, intellectual properties, documentation and
their contained description of methods, remain the (c) copyright of
Kevin Picone 1996.
I hereby grant permission for this code to be used freely in either
P.D./Freeware/Shareware/commercial software releases or used as the basis
for a better C2P solution in PD/Freeware/Shareware/Commercial software
releases, with only one condition, you *MUST* please credit me for my
work.
This archive may be freely distributed via any means.
* Disclaimer:
=============
I in no way imply either directly or indirectly, that the described methods
or included source code(s) are the best (fastest) possible C2P solutions
in any/some/all cases or situations, they are provided as purely _optional_
methods to solve the c2p bottleneck.
* What is PLLB-C2P ?
====================
Precalc Linear Lookup Blitter - Chunky To Planar is actually a C2P
system and not just a few assorted routines. The system (although it was
originally designed purely for my own use) attempts to allow the programmer
too easily setup and support various bitplane depths, pixel widths and
conversion methods with relative ease.
* What are 'Normal' 'Delta' & 'Nullskip' c2p routines ?
=======================================================
Pllpc2p has built in support for various types of c2p conversion. These
methods are 'Normal', 'Delta', 'NullSkip' & 'DeltaNullSkip'. I've done
this for one rather obvious reason, as it's quite common that specific
rendering algorithms can have their performance enhanced by using
customized c2p solutions. Hence, I've given you as many choices as I
possibly could, actually I've probably gone overboard, but who cares. ;)
* Normal - This method is the simplest type of C2P possible, each
frame it will convert the entire chunky frame buffer into
planar.
Normal C2P is probably best used when your chunky frame
buffer is constantly going to changing by say %80->%100.
* Delta - Delta c2p is a little more complex, it constantly takes
two chunky frame buffers, the chunky frame just rendered
and the last chunky frame rendered, and then compares
them, looking for differences. Each time a difference
is found, it converts that group of pixels into planar.
Delta c2p can be very powerful, and not too mention
quite fast in various situations. Personally, I'd recommend
that you use it over 'normal' c2p, or at least do some
performance tests for your self.
Those Familiar with Delta C2P algo's will notice that PLLB
c2p allows for double buffering of your 'planar' image.
i.e. you don't just render over the current visible
frame. (So no ugly frame cuts)
Moreover, Pllbc2p Delta routines, work in 16 pixel (or
less, depends upon the pixel width) delta fields, for
improved delta frequency and only shift data to ChipRAM
when a difference is found and converted.
* NullSkip - Is a specialize C2P algorithm that scans for groups of
NULL pixels (groups of pixel 0). Each time it locates
a NULL pixel group, it performs a quick clear operation
to the planar buffer and then continues on, instead of
passing the null pixels through the c2p process.
NullSkip C2P is probably best used, when you can be sure
that a large part of chunky frame buffer is going to be
constantly clear.
Nullskip also clears your chunky buffer as it goes.
* DeltaNUllSkip- Is just a logical extension to 'NUllSKIP', and NO it's
not a combination of the 'DELTA' + 'NULLSKIP" Methods.
It's actually a rather clever way to reduce the chipram
access in the null skip algorithm. Which should be very
useful on high end CPU's like the 060 (not tested) and
the 040, and probably even 50mhz 030 systems.
It works via remembering if the present group of NULL
pixels was cleared in the planar buffer during the last
c2p sweep or not. If it was than it doesn't write to chip
at all, if not it performs the clear and sets the delta
tag for this group of pixels. Hence, this reduces the
amount of chipram access and provides a small speed-up
in the process.
DeltaNullSkip also clears out the chunky buffer while it
processes it.
* I don't understand this Pixelwidth stuff ?
============================================
Pllbc2p, like most other C2p routines supports various pixel widths,
including single, double and even quad pixel modes.
In double & Quad pixel mode pllbc2p allows you to render a halved or
quarter sized (width) chunky image, and the c2p routine will automatically
blow it up to your selected screen width. This can greatly improve
both your our algorithms performance and the c2p's routines performance.
Pllbc2p doesn't support Pixel height modes, you'll have to set those up
your self. I normally just use the copper to blur the scan lines.
* The Bizarre Method:
=====================
The following method outline is based upon the 256 colour single pixel
width, other c2p modes may and do vary..
PLLB-C2P is a 16bit combination processor sweep / lookup array and
Blitter resort algorithm, which is really only designed for 020/030 and
ECS/AGA systems, But it can also be fairly useful on 040 processors.
NOTE: PLLB-C2p performance on 040's is largely untested, and has never
===== been tested on 060's AFAIK, But, if the right conditions present
themselves, I would suspect that possibly DELTA'd versions and
the new DELTANULLSKIP routines could be of *SOME* use in avoiding
the infamous a4000/040 to ChipRAM bottleneck.
The basic idea behind the algorithm is the usage of large (sometimes
very large) Pre calculated pixel combination tables. These tables gives us
the ability to directly use the chunky input pixels as lookup array
pointers, into the Pre calculated conversion table. The chunky lookup table,
contains all the possible combinations of two 8bit chunky pixels in their
unrolled planar format. (I.e. 8 bytes...)
Examples. (256colour Pixel width of 1)
(p=plane) p0 p1 p2 p3 p4 p5 p6 p7
Chunky Input $00,$0f = $01,$01,$01,$01,$00,$00,$00,$00
Chunky Input $0f,$00 = $02,$02,$02,$02,$00,$00,$00,$00
Chunky Input $01,$0f = $03,$01,$01,$01,$00,$00,$00,$00
Chunky Input $88,$55 = $01,$00,$01,$02,$01,$00,$01,$02
Chunky Input $ff,$0f = $03,$03,$03,$03,$02,$02,$02,$02
Obviously, with a little simple maths you'll quickly notices that this
means the lookup array for two 8bit chunky pixels ='s (256*256*8) = 512K ;)
and can be as large as 2 megabytes, if available.
Since the tables are unrolled into precalced groups of 8 bytes (2 longwords)
well obviously, the idea is too always MOVE/logical OR from these tables
in longwords, into our tempory data registers on the CPU..
So to overview the conversion process of 16 chunky pixels into planar,
we've got the following simple routine.
Convert the first 8pixels, Pixels 0->7
; A0 = pointer to CHunky pixel buffer
; A5 = Base pointer to the Precalc comb's buffer
; Process 8 Chunky bits into planar
clr.l d1 ; Clear out d1
move.l (a0)+,d0 ; Move Bytes ABCD from chunky buffer
move.w d0,d1 ; move bytes CD into d1
clr.w d0 ; AB-- clear low word of D0
swap d0 ; --AB exchange upper word for lower word
lsl.l #3,d0 ; mult d0 * 8 (width of precalc buffer)
move.l (a5,d0),d3 ; grab planes 1,2,3,4 for pixels AB
move.l 4(a5,d0),d4 ; grab planes 5,6,7,8 for Pixels AB
lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits
lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits
lsl.l #3,d1 ; mult by 8 (width of precalc Buff)
or.l (a5,d1),d3 ; or on planes 1,2,3,4 for pixels CD
or.l 4(a5,d1),d4 ; or on planes 5,6,7,8 for pixels CD
lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits
lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits
;(first 4 pixels are processed at this point)
clr.l d1 ; Clear out d1
move.l (a0)+,d0 ; Move pixels EFGH from chunky buffer intoD0
move.w d0,d1 ; move pixels GH into d1
clr.w d0 ; EF-- clear low word of D0
swap d0 ; --EF exchange upper word for lower word
lsl.l #3,d0 ; mult EF * 8 (width of precalc buffer)
or.l (a5,d0),d3 ; grab planes 1,2,3,4 for pixels EF
or.l 4(a5,d0),d4 ; grab planes 5,6,7,8 for Pixels EF
lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits
lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits
lsl.l #3,d1 ; mult GH by 8 (width of precalc Buff)
or.l (a5,d1),d3 ; or on planes 1,2,3,4 for pixels GH
or.l 4(a5,d1),d4 ; or on planes 5,6,7,8 for pixels GH
d3 = a1b1c1d1e1f1g1h1a2b2c2d2e2f2g2h2a3b3c3d3e3f3g3h3a4b4c4d4e4f4g4h4
d4 = a5b5c5d5e5f5g5h5a6b6c6d6e6f6g6h6a7b7c7d7e7f7g7h7a8b8c8d8e8f8g8h8
Above, I mentioned that the precalc array could be as large as 2 megabytes
of FASTRAM, which is (unfortunately) true. The reason for this is a
simple speedup which enables us to remove the two required 'lsl.l #2,d3'
& 'lsl.l #2,d4' instructions after each 'Move/logical or' completely
from the C2P loop. The idea is to create 4 versions of the Precalced
Array with each one being already shifted into position. Cumbersome
I know, but can make the routine(s) much faster....
Note: The Pllb-C2p system handles all the possible buffer combinations
===== all by it's self.
Anyway, above also I stated that this is a 16bit routine, (which
is needed for the blitter resort pass) so after we've processed
the first 8 chunky pixels into planar, we then repeat the process,
but this time storing the second converted 8 planes of pixels 8 to 15
into say registers d5,d6...
Now, all that's needed is a simple merge process, and then to move the
merged data into CHIP for the Blitter resort pass.
; a1 = output buffer in CHIPRAM
; d2 = $ff00ff00
; d3 = planes 1,2,3,4 pixels 0-7
; d4 = planes 5,6,7,8 pixels 0-7
; d5 = planes 1,2,3,4 pixels 8-15
; d6 = planes 5,6,7,8 pixels 8-15
; resort the planar bytes into words
move.l d3,d0 ; move Planes 1,2,3,4 to d0 (first 8 pixels_
move.l d5,d1 ; move planes 1,2,3,4 to d1 (second 8 pixels)
and.l d2,d0 ; mask with $ff00ff00 d0 ='s planes 1,-,3,-
and.l d2,d1 ; mask with $ff00ff00 d1 ='s planes 1,-,3,-
; (second 8 pixels)
eor.l d0,d3 ; mask off planes 1,-,3,- from planes 1,2,3,4
; d3 = -,2,-4
eor.l d1,d5 ; mask off planes 1,-,3,- from planes 1,2,3,4
; d5 = -,2,-4 second 8 pixels
lsr.l #8,d1 ; d1 = -,1,-,3 .. Second 8 pixels
or.l d1,d0 ; d0 = 1,1,3,3
move.l d0,(a1)+ ; Dump out 32bits (16bits of planes 1 & 3)
lsl.l #8,d3 ; d3 = 2,-,4,-
or.l d5,d3 ; d3 = 2,2,4,4
move.l d3,(a1)+ ; Dump out 32bits (16bits of planes 2 & 4)
(then repeat the process for Planes 4,5,6,7)
One thing, you've no doubt already noticed, is that via this method your
able to write to CHIPRAM in Longwords without a great amount of fuss...
Plus, the routine(s) are small enought to fit nicely within the 68020's
small instruction cache, which should help a great deal. In theory
pllbc2p on an 28mhz 020's should be as if it was running on a 25mhz 030's,
obviously various conditions might hinder this.
Unfortunately, we still need to use the blitter to move/resort the mixed
processed display (which is now in Chipram) from the Tempory Image buffer
to it's final resting place, the screen. To avoid having too mess with the
blitter constantly the final screen(s) need to be defined as normal planar
and *NOT* interleaved, as this allows us to blit move an entire BitPLane
worth of Pixels, per each useage of the blitter.
Hence, a simple hardware bashing routine to Move/Resort a single bitplane
from the tempory image buffer to the display screen, would look something
like this.
Bitter_Moveresort_TempImageBitplane_to_Display: (NICE LABEL ;)
lea.l $dff000,a6
jsr waitblitter ; wait for Blitter ;)
move.l #$09f00000,$40(a6) ; setup blitter miniterms and chan's
move.l #$ffffffff,$44(a6) ; init masks
move.l #((16-2)*$10000),$64(a6) ; setup source and dest modulo's
lea.l BlitterTemp_ImageBuffer,a0
Move.l Display_buffer,a1
add.l d0,a0 ; add on any source plane offset
add.l d1,a1 ; add on any dest plane offset
move.l a0,$50(a6) ; set blitter Chan A pointer to source buf
move.l a1,$54(a6) ; set blitter Chan D pointer to dest Buf
move.w #20*256,$5c(a6) ; (20 words wide * 256 lines in Height
move.w #1,$5e(a6) ; init width (1 word) and START!
rts
The basic idea is to work out some way (depends on what your doing really)
that allows you to trigger the blitter and then just forget about it,
which could be via say, Async blits, Interrupts, Copper or perhaps even
another Task or something. Personally, I just found it simpler (in the
Tmap engine it was originally designed for) to check it's progress at
preselected times. ie. between mapping textures, objects, floors etc etc..
which added minimal (if any really) overhead, and seemed to work just
fine.. ;)..... lazy I know ;)
* What Memory Is needed ?:
==========================
Well again this depends upon the type of game/demo or perhaps util your
trying to develop.
Since, i've included Normal, Delta, Nullskip & now Delta Nullskip versions
of each routine, well obviously your Buffer requirements will differ.
* NORMAL C2P
In FASTRAM, all that you require (min) is too firstly allocate your
source CHUNKY PIXEL buffer and than the number of Precalc tables you wish
to use.. 1->4 ie. 512k->2meg ..
In CHIPRAM, you'll need to allocate the tempory image buffer ie.
((ScreenWidthInPixels/8) * ScreenHeight*bitplanes)='s size of Tempory
image buffer. So if you were only using Two displays for double
buffering originally well now, you'll need an extra display for the
Tempory Image buffer. (sorry)
* DELTA C2P
The only different requirement for using the delta versions, is that
you'll need a second CHUNKY PIXEL buffer. It should also be noted,
that you have to swap chunky pixel buffer pointers after each frame is
rendered, just as you would do normally for double/triple or even
Quad buffering purposes.
* NULL SKIP C2P
The Nullskip versions have the same memory requirements as the Normal
C2P versions...
* DELTA NULLSKIP C2P
Delta Nullskip has the same memory requirements as the Normal c2p version,
except it also needs a special 'delta' buffer so it can remember which
groups of null pixels it's cleared in the planar buffer.
This buffer needs to be in FASTRAM and at least the size of your chunky
buffer divided by 4.
Ie. (ChunkyScreenwidth*ChunkyScreenheight)/4
PLEASE NOTE: Neither the Normal or Delta versions of the routine CLEAR
the source Chunky Frame Buffer.
* Linear Frame Buffer:
======================
After messing with various Tmapped & Gouraud Shaded engines (just like
everybody else) where the requirement of a chunky frame buffer for gfx is
high, the one rather obvious thing I found is that, it's _normally_ much
better to save say a couple of cycles per chunky pixel, via rendering
/ copying across the linear buffer, than to save say a couple of cycles
in the C2P loop, particularly when your constantly rendering a full
frame buffer of pixels, where pixel (texture/gouraud) overlap (ie. polygons
that share the same screen space) is high.
Pllb-c2p is probably quite unique, since it uses Precalc tables to preform
the actual bit shifting process, well we obtain a free linear buffer
and we also achive slightly faster C2P in the process.
* A possible C2P Hint for 'Wolf 3D' or 'Doom' texture mapped engines:
=====================================================================
Without wanting to sound critical of *ANY* of the presently available
Tmap Engines, and having written a 'Blade Stone' styled engine myself,
so I do know just how hard the task actually is, the one thing I noticed
is the lack of specialized C2P for when floor/Ceiling texture mapping isn't
present. Personally, I consider this the ideal time for either a delta'd
C2P routine or a specialized NUll SKIP/CLEAR (you could make it use a
pattern) routine. Via using the latter method, i've been able to obtain
8->9fps refresh updates upon just a a1200/020 14mhz+fastram, with a screen
size of 320*256, 256colour, 1x*2y pixel res, while the engine includes
lightsources, Solid & See through Texture Mapped Walls & 2D mapped
objects.
Also, it becomes pretty obvious that for a Delta'd C2P routine to be all
that useful while Floor/Ceiling Mapping is present, that the artist
should probably think more about making the floors & ceilings (and even
walls where possible) textures a little less complex, which will enchance
preformance greatly. It might also be a nice option for Full or reduced
detail floors & Ceilings, instead of turning them off completely.
* Why supply Quad Pixel width versions ?:
=========================================
Well, what can I say, they give you an optional 'Copper' styled screen
resolution, but without the loss of 24bit colour. (not that it matters
too much at this res ;) Perhaps they can be best used in effects like
the common 'Fire' / 'Water' & 'Life' effects.
* Some Future Idea's:
=====================
* Auto 8bit to 6bit/4bit remapping. (very possible)
* Ham 8 with auto interpolation. (crazy idea at the moment ;)
* That's it from me:
====================
Well, hopefully the enclosed routine(s) are of some use to you, or perhaps
they might inspire you to take this method further than I have, as this
is really only the initial working version(s) of PLLB-C2P, so I've *NO*
doubt there's any number of possible speed-ups just waiting to be found.
If you do bother too enhance any of the pllbc2p routines, well, I'd
appreciated it greatly if you'd let me know, so I can pass this
information to others in the next release of pllbc2p.
Cya,
Kevin Picone
Underware Design
*=----------------------------------------------------------------------=*
T H E E N D
*=----------------------------------------------------------------------=*